Skip to content

Add limits to the size of the string repetition multiplier #23561

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: blead
Choose a base branch
from

Conversation

richardleach
Copy link
Contributor

Historically, given a statement like my $x = "A" x SOMECONSTANT;, no examination of the size of the multiplier (SOMECONSTANT in this example) was done at compile time. Depending upon the constant folding behaviour, this might mean:

  • The buffer allocation needed at runtime could be clearly bigger than the system can support, but Perl would happily compile the statement and let the author find this out at runtime.
  • Constants resulting from folding could be very large and the memory taken up undesirable, especially in cases where the constant resides in cold code.

This commit adds some compile time checking such that:

  • A string size beyond or close to the likely limit of support triggers a fatal error.
  • Strings above a certain static size do not get constant folded.

Things could obviously still go bad at runtime when the multiplier isn't a simple
constant, but that's not for this PR.

Closes #13324, closes #13793, closes #20586.

Besides general correctness checking, the arbitrary cut-off numbers are up for discussion, and please could reviewers suggest any improvements to the_perldiag.pod_ that come to mind.


  • This set of changes requires a perldelta entry, and I will write one post-bikeshedding.

Historically, given a statement like `my $x = "A" x SOMECONSTANT;`, no
examination of the size of the multiplier (`SOMECONSTANT` in this example)
was done at compile time. Depending upon the constant folding behaviour,
this might mean:
* The buffer allocation needed at runtime could be clearly bigger than
  the system can support, but Perl would happily compile the statement
  and let the author find this out at runtime.
* Constants resulting from folding could be very large and the memory
  taken up undesirable, especially in cases where the constant resides
  in cold code.

This commit adds some compile time checking such that:
* A string size beyond or close to the likely limit of support triggers
  a fatal error.
* Strings above a certain static size do not get constant folded.
@richardleach richardleach force-pushed the hydahy/const_fold_repeatmax branch from 953cf3f to 3725af9 Compare August 10, 2025 21:19
@guest20
Copy link

guest20 commented Aug 10, 2025

One of the mottos of perl, at least when I picked up the book, was "no internal limits".

Just to get them out of the way, I'm going to front load these:

  • "back in my day we walked 10 miles up hill in the snow both ways"
  • "no airbags, we die like men"
  • "sounds like a skill issue"
  • "What's next, do I have to get a licence to make toast in my own goddamn toaster?!"
  • etc

More constructively:

  • I don't like the idea of "random" string operators becoming fatal, so
  • I would rather that x did a lazy thing instead
  • "Unrealistically large string repetition value" is pretty subjective. My machine has infinite amount of paper tape (in both directions)... and/or Gbs of ram/swap.
  • Heck, in the case of x, something as simple as some tie/magic with gzip or even just run-length encoding could be enough to save the memory while allowing ones program to dump out 0+[] bytes worth of y's for zip bombs or http buffer overruns or piping to apt-get

@richardleach
Copy link
Contributor Author

Thanks for the feedback, @guest20. I can see that the discussion might have to balance what people could conceivably do with improving the guardrails around the patterns of usage we can observe on CPAN / other public code repositories.

  • "Unrealistically large string repetition value" is pretty subjective. My machine has infinite amount of paper tape (in both directions)... and/or Gbs of ram/swap.

More than SIZE_MAX >> 2 usable RAM/swap and you would be happy using it?

  • Heck, in the case of x, something as simple as some tie/magic with gzip or even just run-length encoding

Yes, the user doing something funky with magic is definitely worth considering.

This PR is checking for a right operand that is CONST (and implicitly that won't have any magic attached). The left operand would therefore have to have magic attached.

Perhaps the code should check for a CONST left operand too? If we did that, it might not cover as many cases, but we could be sure of no magic - in which case, there's also a threshold of IV_MAX, because that's the biggest count that pp_repeat (currently) supports.

@guest20
Copy link

guest20 commented Aug 11, 2025

More than SIZE_MAX >> 2 usable RAM/swap and you would be happy using it?

Yes. And more specifically:

That scene from emperors new groove, in which Kuzco is llama lashed to the back of a log, about to fall over a waterfall and he says "bring it on"

Yes, the user doing something funky with magic is definitely worth considering.

No, what I mean that x† could do the magic, to give back a "lazy string"... so a caller could =~, substr, utf-8 length etc without it needing to allocate / consume all my ram‡

This kind of lazyboi could even be suitable for constant folding, since it has a known truthyness at compile time ("" x ... on one side, or ... x 0 on the other)

__
†. the everything operator?
‡. though I don't mind if perl uses all my memory, it's mostly just being used to maintain hundreds of firefox tabs

@book
Copy link
Contributor

book commented Aug 11, 2025

I like the fact that Perl gives you enough rope to shoot yourself in the foot.

I also don't think this patch solves either of the issues mentioned in the tickets linked in the commit (#13793 and #20586).

@richardleach
Copy link
Contributor Author

More than SIZE_MAX >> 2 usable RAM/swap and you would be happy using it?

Yes.

Care to share details of the platform you're running on to help me understand use cases better?

No, what I mean that x† could do the magic, to give back a "lazy string"... so a caller could =~, substr, utf-8 length etc without it needing to allocate / consume all my ram‡

Ah, not magic in the SvMAGICAL sense, instead a redesign of the repetition operator into some kind of iterator-thing?

Last time I grepped CPAN for this, there was definitely usage where it seemed like people would expect the whole string - e.g. for preparing a buffer or some kind of initialization - and not some iterator behaviour. Maybe that would be more of a feature request for a separate operator?

though I don't mind if perl uses all my memory, it's mostly just being used to maintain hundreds of firefox tabs

That's what i use my RAM for.

@richardleach
Copy link
Contributor Author

I also don't think this patch solves either of the issues mentioned in the tickets linked in the commit (#13793 and #20586).

What would you be looking for to resolve those tickets or declare them "wontfix"?

Both of those tickets are about constant folding producing huge strings, possibly in rarely-taken or even never-taken branches, and that memory use being undesirable. The options suggested there seemed to be:

  • Leave the existing behaviour alone, people get to shoot themselves in the foot and enjoy it.
  • Don't constant fold above a certain threshold [which is what this PR does for a CONST right operatnd at compile time - happy to warn instead of croak though]
  • Some kind of lazy constant folding at run time the first time a branch is encountered. (Not sure what this would look like on a threaded build.)

@richardleach
Copy link
Contributor Author

(I've pushed a commit changing the DIE to a warning, in case that's helpful to the discussion.)

@richardleach richardleach force-pushed the hydahy/const_fold_repeatmax branch from 033a494 to fdc3bd8 Compare August 11, 2025 23:18
For discussions on Perl#23561.

perl -e 'use warnings; my $x = ($_) ? "A" x (2**62) : "Z"'

gives this on blead for me:
```
Out of memory!
panic: fold_constants JMPENV_PUSH returned 2 at -e line 1.
```

on the previous commit, it would die:
```
Unrealistically large string repetition value"
```

With this commit, it just warns:
```
Unrealistically large string repetition value at -e line 1.
```

but will blow up if the repetition OP does get executed:
```
Out of memory in perl:util:safesysrealloc
```
@richardleach richardleach force-pushed the hydahy/const_fold_repeatmax branch from fdc3bd8 to 8fa7f8e Compare August 11, 2025 23:39
@guest20
Copy link

guest20 commented Aug 11, 2025

Ah, not magic in the SvMAGICAL sense, instead a redesign of the repetition operator into some kind of iterator-thing?

Well, to maintain back-compat it'd have to be full of tie-magic so the caller got the corresponding face full of bytes when they print or maybe . the lazy object, etc... I was thinking it as more of a Promise, but you might be right, it's closer to an iterator

@tonycoz
Copy link
Contributor

tonycoz commented Aug 14, 2025

I also don't think this patch solves either of the issues mentioned in the tickets linked in the commit (#13793 and #20586).

#20586 was already addressed by #20595.

I think the OP of #13793 is addressed generally by copy on write as modified by #20595.

This change does address the point @iabyn makes in #20586 (comment) where constant folding of the repetition operator can result in large SVs that last the lifetime of the program, even if they're unused (something I've had to workaround occasionally).

The original error from #13324 has been addressed by integer overflow checks added over the years:

$ perl -le '$x = "abcd"; $y = $x x 0x3FFF_FFFF_FFFF_FFFF'
panic: memory wrap at -e line 1.

and by the Out of memory you get when trying to allocate beyond the memory available:

$ perl -le '$x = "abcd"; $y = $x x 0x2FFF_FFFF_FFFF_FFFF'
Out of memory!

though the error reporting could probably be better.

@guest20
Copy link

guest20 commented Aug 14, 2025

CW: my knowledge of guts is all from rumour and hearsay (possibly also from heresy)

@tonycoz What issue does the thing being in ... constant folded space... cause that the same large string being in regular-old scalar wouldn't cause?

@richardleach
Copy link
Contributor Author

The original error from #13324 has been addressed by integer overflow checks added over the years

I still like the idea of at least warning at compile time that the repetition is likely to be fatal.

}
} else {
NV rhs = 0.0; rhs = SvNV_nomg(constsv);
if (rhs >= (NV)((SIZE_MAX >> 2) +1) ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a -1.0 for safety? It can really questionable what the "53rd digit" to the right side of the . is, and what CC, what CPU, what OS, which security or spectre patch for your OS and CC, and make month and year of all 4 please.

I don't trust C's float/double keyword's rounding modes at all, and were constant folding intermediate values done in FP CPU real instructions or C abstract machine instructions, calculations done at 32, 64, or 80 bit or 128 bit intermediate floating point precision?

the goose has been cooked at malloc(2 GB) or malloc(2GB-1 byte) either way. thats not a future bug ticket.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also don't forget that Intel 64/AMD 64 in 64 bit mode CPUs are incapable of doing 80 bit floating pointer intermediate math unlike 32 bit mode. So >= 2^53 or >= 2^52 starts introducing more and more "error" or rounding into the math formula, and we have a 64 bit memory space on paper (more like 48 bits unless your a rack server of brand new Xeons, which I think finally took another chomp at the AMD64 ISA's central address space gap).

@@ -193,6 +193,14 @@ fresh_perl_like(
eval q{() = (() or ((0) x 0)); 1};
is($@, "", "RT #130247");

# [GH #13324] Perl croaks if a string repetition seems unsupportable
fresh_perl_like(
'use warnings; my $x = "A" x (2**99)',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we try some maximum toxic 0.0 NV literals here? maybe creating them with PP pack and or unpack.

@tonycoz
Copy link
Contributor

tonycoz commented Aug 17, 2025

What issue does the thing being in ... constant folded space... cause that the same large string being in regular-old scalar wouldn't cause?

The problem is if I have code like:

sub f ($x) {
  if ($Config{usually_true}) {
    ...
  }
  else {
    ... "abc" x 50_000_000 ...
  }
}

The "abc" x 50_000_000 is constant folded into a 150MB SV and kept in the OP[1] tree even though that SV might never be used.

Without constant folding the 150MB SV would only be created when that else runs.

It is possible to workaround this, since perl's optimiser doesn't propagate assignments into constant folding[2]:

  my $abc = "abc"; # set but never modified
  ...  $abc x 50_000_000 ...  # not constant folded

I expect this type of very large constant folded string to be rare, but it can be a surprise if you're monitoring your process memory usage (or trying to diagnose memory issues once they happen.)

An alternative might be to warn on large constant folds but that seems like a "it hurts when I do that (constant fold to large strings)" rather than a solution (don't do that).

[1] later moved to the pad in threaded builds, and duplicated on thread creation
[2] I'll have to update some tests if we ever change that

@bulk88
Copy link
Contributor

bulk88 commented Aug 19, 2025

What issue does the thing being in ... constant folded space... cause that the same large string being in regular-old scalar wouldn't cause?

The problem is if I have code like:

sub f ($x) {
  if ($Config{usually_true}) {
    ...
  }
  else {
    ... "abc" x 50_000_000 ...
  }
}

The "abc" x 50_000_000 is constant folded into a 150MB SV and kept in the OP[1] tree even though that SV might never be used.

Without constant folding the 150MB SV would only be created when that else runs.

It is possible to workaround this, since perl's optimiser doesn't propagate assignments into constant folding[2]:

  my $abc = "abc"; # set but never modified
  ...  $abc x 50_000_000 ...  # not constant folded

I expect this type of very large constant folded string to be rare, but it can be a surprise if you're monitoring your process memory usage (or trying to diagnose memory issues once they happen.)

An alternative might be to warn on large constant folds but that seems like a "it hurts when I do that (constant fold to large strings)" rather than a solution (don't do that).

[1] later moved to the pad in threaded builds, and duplicated on thread creation [2] I'll have to update some tests if we ever change that

Is there any API rule that would not allow rewriting threaded perl's PL_curpad[PL_op->op_targ] SV* marked with SvREADONLY_on and SvPOK_on flags, while inside the runloop?

Why not constant fold it the 1st time it executes, then leave it forever in the OP tree? But if its not BEGIN {} time, those PP CV*s are process eternal. Nobody "tree shakes", re-"compiles' subroutines, or chainsaws useless CV*s and GV* from their own package, or any other packages/.pms in their perl process.

I think it might be technically impossible to really delete CV*s and GV*s from your address space since= barewordsubs(); are burned into the optree, so gv = ;anddelete ${'???::'}->{gv};still won't make theCV*/GV*` drop to 0 RC.

I've done experiments with hv_common() and pp_*()s upgrading incoming PAD SV* keysv args (not the char* key) from COW255/NEWX to HEK* READONLY but 30%-60% of cases don't work, because that particular pp_*(), and the SV* in its TARG slot that it sends to hv_common(), is a 2nd or 3rd generation my $var that gets wiped between every call to that pp_*().

my @arr = qw( FIELD1 FIELD2 FIELD1 );
if (exists $self->{shift(@arr))) { # no chance of hv_common() upgrading this to a HEK*
   1;
}
if (exists $self->{$arr[0]) { # no chance of hv_common() upgrading this to a HEK*
   1;
}

It has to be done at the toke.c/op.c level for the HEK* to last to the next entry to this sub.

SV *constsv = cSVOPx_sv(cBINOPo->op_last);
UV arbitrary = 1024 * 1024;

if (SvIOKp(constsv)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why checkSvIOKp(constsv) instead of SvIOK(constsv)? doesn't pmeans its a low resolution rounded answer and not the official value of theSV*? Why or why not (52 bits vs 32 bits, 52 bits vs 64 bits) check for NVf*/NVp` flags first ?


if (SvIOKp(constsv)) {
if (SvIOK_UV(constsv)) {
if (SvUVX(constsv) > SIZE_MAX >> 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be >>3 for 64b cpus and >> 2 for 32 bit cpus?

C:\sources>perl -E" say sprintf('%x', 0xFFFFFFFF>>2);"
3fffffff
C:\sources>perl -E" say  0x3fffffff"
1073741823
C:\sources>

so 1 GB string limit for i386 procs? that is too high.

I would set that to 512 MB or 256MB. Win32 on i386's user mode contiguous free linear address space is only 1.2 GB.

So following smartphone/browser rules, the guide line now a days is, 1 process, 1 tab, 1 cloud VM gets is 75% of "something" or 25% of "something" before the kernel/browser kills the tab/process.

I've picked "128MB" as the arbitrary cutoff for per-App caches on 512-1GB-2GB phy ran Win2000/WinXP boxes at work.

128MB is the cheapest cloud VM container you can buy AFAIK. Thats another reason to pick that 1/8th of max limit which is 128 MB.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so 1 GB string limit for i386 procs? that is too high.

I would set that to 512 MB or 256MB. Win32 on i386's user mode contiguous free linear address space is only 1.2 GB.

Not every system is win32:

tony@venus:.../perl/git$ cc -std=c2x -m32 -obigalloc bigalloc.c
tony@venus:.../perl/git$ ./bigalloc
success psize 4 2147483647
tony@venus:.../perl/git$ cat bigalloc.c
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

int main() {
  size_t sz = 0x7fff'ffff;
  void *p = malloc(sz);
  if (!p) {
    perror("malloc");
    exit(1);
  }
  memset(p, 0, sz);
  printf("success psize %zu %zu\n", sizeof(p), sz);
}

(c2x for the c23 not-a-comma in numeric literals)

The idea isn't to set a maximum allocation limit but to complain about the very obviously bad allocations.

I don't think 1GB+ string literals on i386 are a good idea, but a careful user could use them.

@tonycoz
Copy link
Contributor

tonycoz commented Aug 20, 2025

Is there any API rule that would not allow rewriting threaded perl's PL_curpad[PL_op->op_targ] SV* marked with SvREADONLY_on and SvPOK_on flags, while inside the runloop?

I think it's a possibility.

But for the large SVs we're talking about here it leads to a related problem.

If the code executes once, and the user cleans up their copy of the SV (eg undef $x which releases the PV[1]) that large PV stays allocated for the rest of the lifetime of the program.

[1] compare $x = undef vs undef $x:

$ perl -MDevel::Peek -e '$x = "abc" x 10; Dump($x); $x = undef; Dump($x); undef $x; Dump($x)'
SV = PV(0x558bed584ea0) at 0x558bed5b1130
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x558bed5a3ba0 "abcabcabcabcabcabcabcabcabcabc"\0
  CUR = 30
  LEN = 32
SV = PV(0x558bed584ea0) at 0x558bed5b1130
  REFCNT = 1
  FLAGS = ()
  PV = 0x558bed5a3ba0 "abcabcabcabcabcabcabcabcabcabc"\0
  CUR = 30
  LEN = 32
SV = PV(0x558bed584ea0) at 0x558bed5b1130
  REFCNT = 1
  FLAGS = ()
  PV = 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants